This post is built for Assignment of ISSS608 Visual Analytics and Application.
This assignment topic is based on Mini-Challenge 2 of VAST Challenge 2021, and there is background introduction of VAST Challenge 2021.
In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.
In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.
The data used in this research is the movement and tracking data of employees. GAStech provides many of their employees with company cars for their personal and professional use, but unbeknownst to the employees, the cars are equipped with GPS tracking devices.
Based on the given tracking data for the two weeks leading up to the disappearance, as well as credit card transactions and loyalty card usage data, we aim to use appropriate visually driven data analysis techniques to complete two objectives:
Identify anomalies and suspicious behaviors.
Identify the owners of each credit and loyalty card.
Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets. It provides a quick and effective way to communicate information in a universal manner using visual information.
In this research, the two main data visualization methods used to attain research goals and answer questions are graphs and maps. Graphs include stable graphs and interactive graphs and the main R packages used are ggplot and plotly. The ggplot package is the most popular data visualization package in the R community. It was created by Hadley Wickham in 2005. It was implemented based on Leland Wilkinson’s Grammar of Graphics — a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. While using ggplot2, we provide the data, call specific function, map our desired variables to aesthetics, define graphical arguments. The plotly is a R package for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js.
The code chunks below are used to install and load the packages in R.
packages = c("tidyverse","tidyr","tidyverse","dplyr","sp","raster","sf",
"vctrs","clock","tmap","rgdal","readr","ggplot2","plotly","tmap","gganimate","av","gifski",
"igraph","tidygraph","ggraph","visNetwork","lubridate","DT","collapsibleTree")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The data was imported by using read.csv().
carassign <- read_csv("./MC2/car-assignments.csv")
creditcard <- read.csv("./MC2/cc_data.csv")
gps <- read_csv("./MC2/gps.csv")
loyaltycard <- read.csv("./MC2/loyalty_data.csv")
Abila_st <- st_read(dsn = "MC2/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`D:\liyuhong0110\BlogFile\_posts\2021-07-05-assignment\MC2\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
Change the data type of MC2-tourist.jpg to tif by save as. (Figure 1)
Figure 1
View the data by the following coding.
head(carassign)
# A tibble: 6 x 5
LastName FirstName CarID CurrentEmploymentType CurrentEmploymentTit~
<chr> <chr> <dbl> <chr> <chr>
1 Calixto Nils 1 Information Technolo~ IT Helpdesk
2 Azada Lars 2 Engineering Engineer
3 Balas Felix 3 Engineering Engineer
4 Barranco Ingrid 4 Executive SVP/CFO
5 Baza Isak 5 Information Technolo~ IT Technician
6 Bergen Linnea 6 Information Technolo~ IT Group Manager
head(creditcard)
timestamp location price last4ccnum
1 01/06/2014 07:28 Brew've Been Served 11.34 4795
2 01/06/2014 07:34 Hallowed Grounds 52.22 7108
3 01/06/2014 07:35 Brew've Been Served 8.33 6816
4 01/06/2014 07:36 Hallowed Grounds 16.72 9617
5 01/06/2014 07:37 Brew've Been Served 4.24 7384
6 01/06/2014 07:38 Brew've Been Served 4.17 5368
head(gps)
# A tibble: 6 x 4
Timestamp id lat long
<chr> <dbl> <dbl> <dbl>
1 01/06/2014 06:28 35 36.1 24.9
2 01/06/2014 06:28 35 36.1 24.9
3 01/06/2014 06:28 35 36.1 24.9
4 01/06/2014 06:28 35 36.1 24.9
5 01/06/2014 06:28 35 36.1 24.9
6 01/06/2014 06:28 35 36.1 24.9
head(loyaltycard)
timestamp location price loyaltynum
1 01/06/2014 Brew've Been Served 4.17 L2247
2 01/06/2014 Brew've Been Served 9.6 L9406
3 01/06/2014 Hallowed Grounds 16.53 L8328
4 01/06/2014 Coffee Shack 11.51 L6417
5 01/06/2014 Hallowed Grounds 12.93 L1107
6 01/06/2014 Brew've Been Served 4.27 L4034
The overview of data are shown in Table 1.
Table 1
| S/N | Name of Data | Number of Observations | Variables |
|---|---|---|---|
| 1 | carassign | 44 | LastName, FirstName, CarID, CurrentEmploymentType, CurrentEmploymentTitle |
| 2 | creditcard | 1490 | timestamp, location, price, last4ccnum |
| 3 | gps | 685,169 | Timestamp, id, lat, long |
| 4 | loyaltycard | 1392 | timestamp, location, price, loyaltynum |
The columns of “timestamp” should be Date data type instead of character, so we change the data type as the following codes.
# Change data type of columns "timestamp"
creditcard$timestamp <- date_time_parse(creditcard$timestamp,
zone="",
format="%m/%d/%Y %H:%M")
loyaltycard$timestamp <- date_time_parse(loyaltycard$timestamp,
zone="",
format="%m/%d/%Y")
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone="",
format="%m/%d/%Y %H:%M")
There are some incorrect columns shown in Figure 2 in the dataset of creditcard and loyaltycard.
Figure 2
For loyaltycard dataset, we correct this problem by the following codes and get the dataset of loyaltycard_final dataset which will be used in the following analysis.
# Build the wrong loyaltycard dataset
wrongloyalty <- loyaltycard %>%
filter(loyaltynum=="")
wrongloyalty <- wrongloyalty[,-4]
names(wrongloyalty)[3] <-"loyaltynum"
wrongloyalty <- wrongloyalty %>%
separate(location,into = c("location","price"),sep ="\\?",convert = TRUE)
head(wrongloyalty)
timestamp location price loyaltynum
1 2014-01-06 Katerina’s Caf 33.54 L6110
2 2014-01-06 Katerina’s Caf 38.65 L6110
3 2014-01-06 Katerina’s Caf 26.46 L8328
4 2014-01-06 Katerina’s Caf 28.00 L6267
5 2014-01-06 Katerina’s Caf 26.51 L9406
6 2014-01-07 Katerina’s Caf 37.44 L2343
# Remove Null rows
loyaltycard <- loyaltycard %>%
filter(loyaltynum!="")
# Change the data type of "price" column
loyaltycard$price <- as.numeric(loyaltycard$price)
class(loyaltycard$price)
[1] "numeric"
# Combine wrongloyaltycard and loyaltycard dataset to build final loyalty card dataset
loyaltycard_final <- rbind(loyaltycard,wrongloyalty)
After completing the same actions in the creditcard dataset as following, we got the dataset of creditcard_final dadtaset.
# Build the wrong creditcard dataset
wrongcredit <- creditcard %>%
filter(is.na(last4ccnum))
wrongcredit <- wrongcredit[,-4]
names(wrongcredit)[3] <-"last4ccnum"
wrongcredit <- wrongcredit %>%
separate(location,into = c("location","price"),sep ="\\?",convert = TRUE)
head(wrongcredit)
timestamp location price last4ccnum
1 2014-01-06 12:56:00 Katerina’s Caf 13.68 8332
2 2014-01-06 13:28:00 Katerina’s Caf 33.54 8411
3 2014-01-06 13:46:00 Katerina’s Caf 18.56 6899
4 2014-01-06 13:50:00 Katerina’s Caf 13.59 1874
5 2014-01-06 13:50:00 Katerina’s Caf 32.64 9617
6 2014-01-06 13:53:00 Katerina’s Caf 36.95 2142
FirstName and LastName columns can be combined to build a new column in the carassign dataset, and then a new dataset called carassign_final was built.
# Combine FirstName and LastName columns
carassign_final <- carassign %>%
unite(Name,c("FirstName","LastName"), sep = " ")
Then we check the missing values and view the wrangled dataset.
integer(0)
integer(0)
[1] 36 37 38 39 40 41 42 43 44
integer(0)
# Check the dataset
head(loyaltycard_final)
timestamp location price loyaltynum
1 2014-01-06 Brew've Been Served 4.17 L2247
2 2014-01-06 Brew've Been Served 9.60 L9406
3 2014-01-06 Hallowed Grounds 16.53 L8328
4 2014-01-06 Coffee Shack 11.51 L6417
5 2014-01-06 Hallowed Grounds 12.93 L1107
6 2014-01-06 Brew've Been Served 4.27 L4034
head(creditcard_final)
timestamp location price last4ccnum
1 2014-01-06 07:28:00 Brew've Been Served 11.34 4795
2 2014-01-06 07:34:00 Hallowed Grounds 52.22 7108
3 2014-01-06 07:35:00 Brew've Been Served 8.33 6816
4 2014-01-06 07:36:00 Hallowed Grounds 16.72 9617
5 2014-01-06 07:37:00 Brew've Been Served 4.24 7384
6 2014-01-06 07:38:00 Brew've Been Served 4.17 5368
head(carassign_final)
# A tibble: 6 x 4
Name CarID CurrentEmploymentType CurrentEmploymentTitle
<chr> <dbl> <chr> <chr>
1 Nils Calixto 1 Information Technology IT Helpdesk
2 Lars Azada 2 Engineering Engineer
3 Felix Balas 3 Engineering Engineer
4 Ingrid Barranco 4 Executive SVP/CFO
5 Isak Baza 5 Information Technology IT Technician
6 Linnea Bergen 6 Information Technology IT Group Manager
head(gps)
# A tibble: 6 x 4
Timestamp id lat long
<dttm> <dbl> <dbl> <dbl>
1 2014-01-06 06:28:00 35 36.1 24.9
2 2014-01-06 06:28:00 35 36.1 24.9
3 2014-01-06 06:28:00 35 36.1 24.9
4 2014-01-06 06:28:00 35 36.1 24.9
5 2014-01-06 06:28:00 35 36.1 24.9
6 2014-01-06 06:28:00 35 36.1 24.9
We can see that there are missing values (Figure 3) in rows from 36 to 44 in carassign_final dataset. It means that cardID is lack for truck drivers.
Figure 3
After observing we find that only car ID from 1 to 35 will be used, so car ID included other number Figure 4 will be excluded from the analysis of gps dataset.
Figure 4
There are two dataset can be used to identify the popular locations, and after obversing the relationship of two dataset we find that there is some purchase was recorded in both dataset due to the same purchase time and purchase price. Hence, we will be based on loyaltycard_final and creditcard_final respectively.
To identify the the popular locations based on loyalty card dataset, we use the following steps.
# Calculate the number of locations appeared based on loyalty card
location_num_loyaltycard <- loyaltycard_final %>%
group_by(location) %>%
summarise(num_loyalty=n()) %>%
arrange(desc(num_loyalty))
# Plot the number of locations appeared based on loyalty card
p1 <- plot_ly(data = location_num_loyaltycard,
x = ~location,y = ~num_loyalty,
type="bar",
text = ~paste("Location:", location,
"<br>Frequency:", num_loyalty)) %>%
layout(yaxis = list(title = 'Frequency'),
xaxis = list(title = 'Location',
categoryorder = "array",
categoryarray = ~num_loyalty),
title = 'The Frequency of Each Location Based on Loyalty Card')
p1
The graph shows that the five most popular locations is: Katerina’s Caf, Hippokampos, Guy’s Gyros, Brew’ve Been Served, and Hallowed Grounds (Figure 5).

Figure 5
To identify the the popular locations based on credit card dataset, we use the following steps.
# Calculate the number of locations appeared based on credit card
location_num_creidtcard <- creditcard_final %>%
group_by(location) %>%
summarise(num_credit=n()) %>%
arrange(desc(num_credit))
# Plot the number of locations appeared based on credit card
p2 <- plot_ly(data = location_num_creidtcard,
x = ~location,y = ~num_credit,
type="bar",
text = ~paste("Location:", location,
"<br>Frequency:", num_credit)) %>%
layout(yaxis = list(title = 'Frequency of happened purchase'),
xaxis = list(title = 'Location',
categoryorder = "array",
categoryarray = ~num_credit),
title = 'The Frequency of Each Location Based on Credit Card')
p2
The graph shows that the five most popular locations is: Katerina’s Caf, Hippokampos, Guy’s Gyros, Brew’ve Been Served, and Hallowed Grounds (Figure 6), which is the same as the result based on loyalty card dataset.

Figure 6
The hour and minute was recorded in Timestamp columns in creditcard dataset, but not in loyalty dataset. Hence, we will figure out the popular days based on loyalty dataset, and find the pupular moment based on creditcard dataset.
Write the following codes to draw the graph for the five most popular locations.
# Build new dataset for the five most popular locations
loyalty_day <- loyaltycard_final %>%
filter(location==c("Katerina’s Caf", "Hippokampos", "Guy's Gyros",
"Brew've Been Served", "Hallowed Grounds"))
# Build a new column for day and change the data type
loyalty_day$day <- as.numeric(get_day(loyalty_day$timestamp))
# Wrangle the dataset for the frequency of purchase for every day
loyalty_day_num <- loyalty_day %>%
group_by(day) %>%
summarise(frequency=n())
# Draw the graph
p3 <- plot_ly(data = loyalty_day_num,
x = ~day,
y = ~frequency,
type = 'scatter',
mode = 'lines',
text = ~paste("Day:", day,
"<br>Frequency:", frequency),
name="Trend") %>%
layout(yaxis = list(title = 'Frequency'),
xaxis = list(title = 'Day'),
title = "Frequency of Loyalty Card Purchase for Five Most Pupular Locations") %>%
add_trace(mode = 'markers',
name="Detail")
p3
The line graph shows that 2014/1/6, 2014/1/16 and 2014/1/17 are the most popular days for the five most popular locations due to the peak that appeared in these dates.
Write the following codes to draw the 2D Histograms graph for the 20 most popular locations.
# Build new dataset for the five most popular locations
credit_hour <- creditcard_final %>%
filter(location==c("Katerina’s Caf", "Hippokampos", "Guy's Gyros",
"Brew've Been Served", "Hallowed Grounds"))
# Build a new column for day and change the data type
credit_hour$hour <- as.numeric(get_hour(credit_hour$timestamp))
# Wrangle the dataset for the frequency of purchase for every day
credit_hour_num <- credit_hour %>%
group_by(location,hour) %>%
summarise(frequency=n())
# Draw the graph
p4 <- ggplot(data=credit_hour_num,
aes(x=location,y=hour,fill=frequency))+
geom_tile()+
scale_y_continuous(limits = c(0,24),
breaks = c(0,2,4,6,8,10,12,14,16,18,20,22,24))+
scale_fill_gradient2(low="blue",
high="red")+
labs(x="Location",
y="Hour",
title="Frequency of Credit Card Purchase for Five Most Popular Locations")+
theme(plot.title=element_text(hjust=0.5))
p4

Based on the above graph, we can find that 7:00-8:00 is popular for Brew’ve Been Served and Hallowed Grounds, and 13:00-14:00 and 19:00-20:00 is popular for Katerina’s Caf, Hippokampos, Guy’s Gyros.
Based on the frequency of purchase, we knew that there are a cycle of low frequency of purchase for the five most popular locations, and the time interval was 5 days. We can speculate that 2014/1/11, 2014/1/12, 2014/1/18 and 2014/1/19 are weekends. And after checking the calender, we can confirm our guess. Hence, we can conclude that these locations were the important locations for GAStech employees to break, have meals and work in weekdays, and employees went to these location less in weekends. The coffee shops provided suitable and comfortable environment for the GAStech employees who leaf companies.
Figure 7
Figure 8
Figure 9
GAStech should improve efficiency and arrange daytime reasonably to reduce overtime work and keep employees healthy.
GAStech should provide more comfortable restaurants for employees within the area of company, which will be more convenient for employees to have their meals nearby instead of driving to have meals.
For some necessary overtime work, the company should provide comfortable working space in the evening.
Using gps dataset, we can get maps of GPS paths and stop-by points for each car.
Firstly, plot the map of Kronos and build the dataset of GPS path.
# Import MC2-tourist.tif
bgmap <- raster("MC2/MC2-tourist.tif")
# Tidy up dataset
gps_final1 <- gps
gps_final1$id <- as_factor(gps_final1$id)
gps_final1$day <- as.factor(get_day(gps_final1$Timestamp))
gps_sf <- st_as_sf(gps_final1,
coords=c("long","lat"),
crs=4326)
# Build the dataset of GPS path
gps_path <- gps_sf %>%
group_by(id,day) %>%
summarize(m=mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
And then we should figure out the stop-by points. In this part, the stop-by points mean that the person stops the cars for a period larger than 300 seconds (5 minutes). Start time means the start time of stopping cars and end time means the end time of stopping cars (restart cars and move forward). Time different (tdiff) is calculated by End time-Start time, and the stops of cars with a period larger than 300 seconds will be selected.
# Calculate and select rows with the time difference larger than 300 seconds
gps_stop <- gps
gps_stop <- gps_stop %>%
arrange(id,Timestamp)
gps_stop$tdiff <- unlist(tapply(gps_stop$Timestamp, INDEX = gps_stop$id,
FUN = function(x) c(0, `units<-`(diff(x),"secs"))))
gps_stop <- gps_stop %>%
filter(tdiff>300) %>%
rename(endtime=Timestamp) %>%
mutate(starttime=endtime-tdiff)
gps_stop_sf <- st_as_sf(gps_stop,
coords=c("long","lat"),
crs=4326)
Plot the GPS path and stop-by points for each car, and we use the graph of the car whose ID is 1 as the example.
# Set the tmap mode to "view"
tmap_mode("view")
# Select the dataset for ID 1 car
gps_stop_sf_selected <- gps_stop_sf %>%
filter(id==2)
gps_path_selected <- gps_path %>%
filter(id==2)
# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_stop_sf_selected) +
tm_dots(size = 0.075)+
tm_shape(gps_path_selected) +
tm_lines(col = "day",
lwd = 3)
Based on the GPS path and stop-by points for each car, we can reassess the anomalies in 5.2.3. Coffee shops were indeed the popular locations due to frequent appears of stop-by points in coffee shops and someone would go to coffee shops in the evening after leaving the company (may work in coffee shops), as showing the following graph for car ID 6 in the date of 6.
# Select the dataset for ID 6 car in the date of 6
gps_stop_sf_selected_work <- gps_stop_sf %>%
filter(id==6,get_day(endtime)==6)
gps_path_selected_work <- gps_path %>%
filter(id==6,day==6)
# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_stop_sf_selected_work) +
tm_dots(size = 0.075)+
tm_shape(gps_path_selected_work) +
tm_lines(col = "day",
lwd = 3)
After ID 6 car leaf GAStech at 17:30 at the date of 2012/1/6, the user of this car arrived Guy’s Gyros (coffee shops) at 18:45 and arrived Katerina’s Caf at 20:14. The user may work in these two coffee shops.
Figure 10
After observing, we knew that there are some purchase records show the same purchase price, purchase locations and purchase date in the dataset for both credit card and loyalty card, so we can speculate that the purchse records were done at the same time by the same person. Hence, we can match credit cards and loyalty cards firstly and then figure out the owners of them.
# Match cards by the same purchase price, purchase locations and purchase date
creditcard_final_day <- creditcard_final
creditcard_final_day$timestamp <- format(creditcard_final_day$timestamp,format="%Y-%m-%d")
loyaltycard_final_day <- loyaltycard_final
loyaltycard_final_day$timestamp <- as.character(loyaltycard_final_day$timestamp)
loyal_creidt <- inner_join(creditcard_final_day,loyaltycard_final_day,
by=c("price","location","timestamp")) %>%
select(last4ccnum,loyaltynum)
# Build dataset for the result of matching
loyal_creidt_unique <- unique(loyal_creidt)
loyal_creidt_unique <- loyal_creidt_unique%>%
group_by(last4ccnum) %>%
summarise(loyaltynum = paste(loyaltynum, collapse = ", "))
We build the network graph to show the relationship between cards and locations according to the purchase records of credit cards.
# Transforming data
creditcard_final_net <- creditcard_final
creditcard_final_net$last4ccnum <- as.character(creditcard_final_net$last4ccnum)
creditcard_final_net$Day <- get_day(creditcard_final_net$timestamp)
creditcard_final_net$Hour <- get_hour(creditcard_final_net$timestamp)
# Creating nodes list
sources <- creditcard_final_net %>%
distinct(last4ccnum) %>%
rename(label = last4ccnum)
destinations <- creditcard_final_net %>%
distinct(location) %>%
rename(label = location)
cc_nodes <- full_join(sources,
destinations,
by = "label")
cc_nodes <- cc_nodes %>%
rowid_to_column("id")
# Creating and tidying edges list
edges <- creditcard_final_net %>%
group_by(last4ccnum, location, Day, Hour) %>%
summarise(weight = n()) %>%
ungroup()
cc_edges <- edges %>%
left_join(cc_nodes,
by = c("last4ccnum" = "label")) %>%
rename(from = id)
cc_edges <- cc_edges %>%
left_join(cc_nodes,
by = c("location" = "label")) %>%
rename(to = id)
cc_edges <- select(cc_edges, from, to,
Day, Hour, weight)
cc_graph <- tbl_graph(nodes = cc_nodes,
edges = cc_edges,
directed = FALSE)
# Plot the interactive network
visNetwork(cc_nodes,
cc_edges) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)
By choosing the credit card ID we can know the locations that this card was used in the given period of two weeks. To know more about the details, like purchase frequency in each location and corresponding loyalty cards, we built a data table that can help us to finding the card owners more accurately.
# Prepare dataset
creditcard_final_loyalty_DT <- creditcard_final %>%
group_by(last4ccnum,location) %>%
summarise(purchasenum=n())
creditcard_final_loyalty_DT <- left_join(creditcard_final_loyalty_DT,
loyal_creidt_unique,
by="last4ccnum")
creditcard_final_loyalty_DT <- creditcard_final_loyalty_DT %>%
arrange(desc(purchasenum))
# Build DT
DT::datatable(creditcard_final_loyalty_DT)
To find the owner of cards we will take the following steps (Use the credit card whose last4ccnum is 4795 as the example):
Figure 11
After comparing the purchase locations with GPS path and stop-by points graphs of all the cars, we can find the Car ID of probable owners because the stop-by points were probably the same as or similar as the purchase locations.
The following GPS path and stop-by points graphs was selected as the best candidate. The car ID of this graph is 17, and the name of car user is Flecha Sven.
Figure 12
Figure 13
Figure 14
Figure 15
Because we needed to compare the purchase locations with GPS path and stop-by points graphs of all the cars, there were potential possibility of judging wrongly and mismatching the owner with the cards.
Long time was needed to judge and match each card and each person.
The lack of geographical coordinates for each locations caused difficulties for us to confirm each location, and we could only find the locations in the map. And if there were geographical coordinates for each locations, we could plot the purchase records in the map directly and then compared with the GPS Path to match owners and cards. It will be more accurate and efficient to complete the task of matching.
The purchase records of some credit cards were strange. For these credit cards, purchase records of one credit card were recorded in different loyalty cards. It seems that the owner of this credit card had two of more loyalty cards or the owner would borrow other persons’ loyalty cards.
The timestamp column of loyalty card dataset is different from the timestamp column of credit card dataset, and it only shows the date but not the time in detail. Hence, we can not match the loyalty card number and credit card number accurately according to the same timestamp value.
To find the relationship of employees, we should plot the network through tree map.
# Now we can plot the tree maps
p6 <- collapsibleTree(carassign_final,
hierarchy = c("CurrentEmploymentType", "CurrentEmploymentTitle","Name"),
root = "GASTech",
width = 800,
fill = c("seashell",
rep("brown", length(unique(carassign_final$CurrentEmploymentType))),
rep("khaki", length(unique(paste(carassign_final$CurrentEmploymentType,
carassign_final$CurrentEmploymentTitle)))),
rep("forestgreen", length(unique(paste(carassign_final$Name,
carassign_final$CurrentEmploymentType))))))
p6
Based on the above graph, the official relationship within employees in GASTech can be got. For example, there were five employees (one is IT helpdesk, three are IT Technician and the remaining one is IT Group Manager) whose work type was relative to Information Technology. Hence, they were likely to be colleagues, and IT Group Manager, Linnea Bergen, was the leader of other four employees.
Figure 16
In order to compare the path of each employee for every day, we divided the GPS path by every day as the following. And then we plotted the maps for all employees and then tried to find suspicious activities and locations. Make car of ID 19 as example.
# Try to plot the needed graphs
# Set the tmap mode to "view"
tmap_mode("view")
# Select the dataset for ID 1 car
gps_stop_sf_selected <- gps_stop_sf %>%
filter(id==19)
gps_stop_sf_selected_day <- gps_stop_sf_selected %>%
mutate(day=get_day(endtime))
gps_stop_sf_selected_day$day <- as.factor(gps_stop_sf_selected_day$day)
gps_path_selected <- gps_path %>%
filter(id==19)
# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_stop_sf_selected_day) +
tm_dots(size = 0.5,
col = "day")+
tm_shape(gps_path_selected) +
tm_lines(col = "day",
lwd = 7)+
tm_facets(by = "day",
ncol = 2,
free.scales = FALSE)+
tm_layout(legend.title.size=1,
legend.text.size = 0.6)
Desafio Golf Course seemed to be a popular locations for GASTech employees to relax in weekends, but too many employees went here at the same time in the weekend just before the event of disappear, like the owners of car with ID 4, 28, 31, 32, 35.
According to the GPS records of the car of ID 31, we could know that this owner drove this car seldom. However, the owner drove out from 2014/1/17 to 2014/1/19, and went to the Desafio Golf Course on 2014/1/19 which was the date just before celebration. It was strange for a person who drove less to drive for continuous three days suddenly.
There was a white house that some employees would start the path from here and return here after finishing whole day’s schedual (ID 2, 3, 9, 10, 11, 19 and so on). We could make a conjecture that this white house is a residential zone for these persons. It may not be a suspicious location that related to the disappear event, but it is helpful for us to confirm the function of living for this white house when analyzing the behaviors of employees.
Figure 19
GPS path map showed that many employeew went to Alberts Fine Clothing often, especially in the weekend days of 11, 12, 18 and 19. Though they probably made an appointment to go shopping together, but it was still very abnormal that several employees went to together. At the same time, Alberts Fine Clothing was an extremely popular locations among all the locations because almost every employees have the records of stopping and shopping here, though this clothing shop was far from GASTech. Although they might be buying the clothes for attending the celebration of their company, the extremely high frequency was suspicious.
Figure 21
Especially, the owner of the car with the ID of 7 and 32 went to Alberts Fine Clothing almost every weekday.
In the weekend, many employees, like the owner of ID 2 car, would leave their residential zones and make a special trip to this clothing store.
It was also a very popular locations for employees, but we didn’t make accurate justification about function or type of this shop from its name. Many
Figure 24
Some persons would drive for a bit long time to this location at noon after leave company, like owners of cars of ID 11 and 19.
Figure 27
Figure 28
In this research, we use the tracking data of cars, job information of each employees and purchase records of credit card and loyalty card to complete two main objectives of identifying suspicious behaviors and inferring the owners of each credit card and loyalty card by suitable techniques of visualization. After data wrangling and data extraction, the the most popular five locations and the popular periods were figured out, and then the anomalies, like overtime work and inconvenient meals, were found. Secondly, after adding GPS track data the method of matching owners with credit card and loyalty card was built, and we used one credit card as exmaple to show the steps in detail. At the same time, the relationship among all employees was shown by hierarchical graph based on the job information. Last but not least, according to the GPS path and stop-by points of each person, some suspicious activities and locations were figured out to help to retrieve the truth of disappearance.